Skip to content
This repository was archived by the owner on Jun 19, 2025. It is now read-only.

Conversation

@ARF1
Copy link

@ARF1 ARF1 commented Mar 6, 2015

Speedup ca. x1.5 cf. master branch on my machine with compressed bcolz
Approximate contributions:

  • 1/3: direct indexing into arrays using typed memoryviews
  • 1/3: substitution of reverse dict with std::vector objects
  • 1/3: use of nested with nogil, with gil construct

Speedup ca. x1.5 cf. master branch on my machine with compressed bcolz
Approximate contributions:
 - 1/3: direct indexing into arrays using typed memoryviews
 - 1/3: substitution of reverse dict with std::vector objects
 - 1/3: use of nested `with nogil`, `with gil` construct
@ARF1
Copy link
Author

ARF1 commented Mar 6, 2015

Uncompressed bcolz timings on my machine:

bquery master:
In [3]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True)
1 loops, best of 10: 2.46 s per loop

pull request:
In [3]: %timeit -r 10 a.cache_factor(['isin'], refresh=True)
1 loops, best of 10: 1.59 s per loop

==> Factor: 1.5

Compressed bcolz timings on my machine:

bquery master:
In [3]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True)
1 loops, best of 10: 4.03 s per loop

pull request:
In [3]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True)
1 loops, best of 10: 3.13 s per loop

==> Factor: 1.3

Possible additional optimizations (but probably fairly minor):

  • There should be no need to store reverse_keys in _factorize_str_helper since it is merely an increasing sequence of integers up to reverse_values.size-1 (I wanted to keep the code logic as close to the original as possible.)
  • In factorize_str the reverse python dictionary (insertion expensive hash-table, I think) is created only to throw it away after creation of carray_values (Changing this would have obvious knock-on effects on other helper functions.)

ARF1 pushed a commit to ARF1/bquery that referenced this pull request Mar 6, 2015
Shared variables with manual locking:
- hash table
- count
- reverse_keys
- reverse_values
- out_buffer
- chunk_

Shared variables without locking requirement:
- locks

Thread-local variables:
- thread_id
- in_buffer_ptr (points to thread-local buffer)
- out_buffer_ptr (points to thread-local buffer)

Locking scheme:
- For each thread a lock on the hash table (and other associated shared variables) exists.
- Each thread processing a chunk begins by acquiring its own lock on the shared hash table.
- The lock is released when the thread encounters an value that is new to the hash table.
- Once the thread is ready to write to the hash table, it waits to acquire the locks from all threads.
- After the write all locks are released.

---
Uncompressed bcolz timings:
```
--- uncached unique() ---
pandas (in-memory):
In [10]: %timeit -r 10 c.unique()
1 loops, best of 10: 881 ms per loop

bquery master over bcolz (persistent):
In [12]: %timeit -r 10 a.unique('mycol')
1 loops, best of 10: 2.1 s per loop
==> x2.38 slower than pandas

pull request over bcolz (persistent):
In [8]: %timeit -r 10 a.unique('mycol')
1 loops, best of 10: 834 ms per loop
==> x1.05 FASTER than pandas

---- cache_factor ---
bquery master over bcolz (persistent):
In [3]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True)
1 loops, best of 10: 2.51 s per loop

pull request with 2 threads over bcolz (persistent):
In [3]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True)
1 loops, best of 10: 1.16 s per loop
==> x2.16 faster than master

pull request with 1 thread over bcolz (persistent):
In [3]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True)
1 loops, best of 10: 1.69 s per loop
==> x1.48 faster than master (c.f. x1.48 from single-threaded PR visualfabriq#21)
==> parallel code seems to have no performance penalty on single-core machines
```

Compressed bcolz timings:
```
--- uncached unique() ---
pandas (in-memory):
In [10]: %timeit -r 10 c.unique()
1 loops, best of 10: 881 ms per loop

bquery master over bcolz (persistent):
In [12]: %timeit -r 10 a.unique('mycol')
1 loops, best of 10: 3.39 s per loop
==> x3.85 slower than pandas

pull request over bcolz (persistent):
In [8]: %timeit -r 10 a.unique('mycol')
1 loops, best of 10: 1.9 s per loop
==> x2.16 slower than pandas

---- cache_factor ---
bquery master over bcolz (persistent):
In [5]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True)
1 loops, best of 10: 4.09 s per loop

pull request with 2 threads over bcolz (persistent):
In [5]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True)
1 loops, best of 10: 2.48 s per loop
==> x1.65 faster than master

pull request with 1 thread over bcolz (persistent):
In [5]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True)
1 loops, best of 10: 3.26 s per loop
==> x1.25 faster than master (c.f. x1.28 from single-threaded PR visualfabriq#21)
```
FrancescElies pushed a commit to FrancescElies/bquery that referenced this pull request Mar 16, 2015
Shared variables with manual locking:
- hash table
- count
- reverse_keys
- reverse_values
- out_buffer
- chunk_

Shared variables without locking requirement:
- locks

Thread-local variables:
- thread_id
- in_buffer_ptr (points to thread-local buffer)
- out_buffer_ptr (points to thread-local buffer)

Locking scheme:
- For each thread a lock on the hash table (and other associated shared variables) exists.
- Each thread processing a chunk begins by acquiring its own lock on the shared hash table.
- The lock is released when the thread encounters an value that is new to the hash table.
- Once the thread is ready to write to the hash table, it waits to acquire the locks from all threads.
- After the write all locks are released.

---
Uncompressed bcolz timings:
```
--- uncached unique() ---
pandas (in-memory):
In [10]: %timeit -r 10 c.unique()
1 loops, best of 10: 881 ms per loop

bquery master over bcolz (persistent):
In [12]: %timeit -r 10 a.unique('mycol')
1 loops, best of 10: 2.1 s per loop
==> x2.38 slower than pandas

pull request over bcolz (persistent):
In [8]: %timeit -r 10 a.unique('mycol')
1 loops, best of 10: 834 ms per loop
==> x1.05 FASTER than pandas

---- cache_factor ---
bquery master over bcolz (persistent):
In [3]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True)
1 loops, best of 10: 2.51 s per loop

pull request with 2 threads over bcolz (persistent):
In [3]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True)
1 loops, best of 10: 1.16 s per loop
==> x2.16 faster than master

pull request with 1 thread over bcolz (persistent):
In [3]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True)
1 loops, best of 10: 1.69 s per loop
==> x1.48 faster than master (c.f. x1.48 from single-threaded PR visualfabriq#21)
==> parallel code seems to have no performance penalty on single-core machines
```

Compressed bcolz timings:
```
--- uncached unique() ---
pandas (in-memory):
In [10]: %timeit -r 10 c.unique()
1 loops, best of 10: 881 ms per loop

bquery master over bcolz (persistent):
In [12]: %timeit -r 10 a.unique('mycol')
1 loops, best of 10: 3.39 s per loop
==> x3.85 slower than pandas

pull request over bcolz (persistent):
In [8]: %timeit -r 10 a.unique('mycol')
1 loops, best of 10: 1.9 s per loop
==> x2.16 slower than pandas

---- cache_factor ---
bquery master over bcolz (persistent):
In [5]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True)
1 loops, best of 10: 4.09 s per loop

pull request with 2 threads over bcolz (persistent):
In [5]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True)
1 loops, best of 10: 2.48 s per loop
==> x1.65 faster than master

pull request with 1 thread over bcolz (persistent):
In [5]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True)
1 loops, best of 10: 3.26 s per loop
==> x1.25 faster than master (c.f. x1.28 from single-threaded PR visualfabriq#21)
```
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant